For this project, we looked at Palworld, an open-world video game where players can collect and battle with various creatures known as Pals. In the game, Pals possess unique attributes and abilities that allow them to perform work, fight, or produce tools for the player to utilize on their journey. In Palworld, there is a feature called rarity that distinguishes Pals based on their strength or availability within the world. With the diverse range of Pals from Common to Legendary, understanding the key determinants of a Pal's rarity is crucial for players aiming to optimize their gameplay experience. So as a team, we aim to analyze the key features that make Pals rare.
In our analysis, we aim to explore the relationship between a Pal's rarity and specific features such as Melee Attack, HP, Defense, Shot Attack, Stamina, Price, and Support. Therefore, our research question is as follows: What features have the most impact in determining the rarity of a Pal?. The primary goal of this project is to leverage the scraped data to accurately classify the rarity level of each Pal using their numerical statistics. By developing and implementing machine learning models, we aim to predict Pal rarity with precision. Additionally, we seek to thoroughly evaluate and understand the performance of these models by generating confusion matrices, accuracy scores, and feature importance plots. An essential aspect of this analysis is identifying and visualizing the most significant features that contribute to the classification of Pal rarity, providing deeper insights into the factors that influence these predictions.By understanding the relationship between these features and rarity, we hope it will provide valuable insight into different Pals potential and value to players of the game.
For this project we scraped data from Palworld.gg to get all of our information.
In our data collection, we grabbed the following data:
For our data, we did not end up having to clean much. For us we ended up only having to clean the names of each Pal from non-alphanumeric characters, digits, and spaces.
For this, our numerical data was scaled from the sklearn module in the StandardScaler.
Our primary target value in this research are the Rarity and the Level values. This is because they directly relate to our research question on rarity. For our data - the Level values are measured from (1-20) and then then the Rarity is measured from Common, Rare, Epic and Legendary.
For all of our other features that we are studying (Speed, Price, Attack, HP, Defense, etc.), they are stats of a Pal that help determine their worth in battle, crafting, or capabilities within the world. As mentioned before, we assume that these features will help determine the rarity of each given Pal and that their general stats will increase as the rarity increases.
Some potential problems about the data would be that there was a data set that we did not collect but could have been important. This dataset would be the Crafting Levels of each Pal as they provide valuable insight on the work capabilities of a Pal which in turn could help raise the rarity. We had opted not to collect this as it was tedious to extract the data and recompile for a factor that in turn could have been its own project to compare to rarity. In addition to this, we opted to remove the dataset for Crafting Speed because they are all the same between each Pal. Another slight concern was with our regression assumption check which we do to check normality. For us, we found that the Discrete data that we utilized does not fit into any regression well and that only classification works, we will further go into depth later in the interpretation.
# Put all the module import in this code chunk
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
import re
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import pylab as py
import plotly.express as px
import scipy.stats as stats
from scipy.stats import linregress
import statsmodels.api as sm
from sklearn.cluster import AgglomerativeClustering
extract_soup(url)
Extracts specific data (like names and URLs) from a web page.
get_pal_details(pal_url)
Fetches details of a specific Pal using the provided URL.
get_all_pals_details(href_list)
Retrieves details for all Pals based on the list of href links.
def extract_soup(url):
""" Extracts Pal names and href links from the given URL.
Args:
url (str): url of website in string format
Returns:
name_list (list): list of pal names
href_list (list): list of pal links
"""
# Get the HTML content from the provided URL
html = requests.get(url).text
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Find all Pal link elements
pals_links = soup.find_all("a", class_="")
# Initialize lists to store names and href links
names_list = []
href_list = []
# Extract names and href links
for link in pals_links:
name_div = link.find("div", class_="name")
if name_div:
name = name_div.text.strip().lower()
href = link['href']
cleaned_name = re.sub(r'[^\w\s]', "", name) # Remove non-alphanumeric characters
cleaned_name = re.sub("\d+", "", cleaned_name) # Remove digits
cleaned_name = cleaned_name.replace(" ", "") # Remove spaces
names_list.append(cleaned_name)
href_list.append(href)
# Return the cleaned names and href links
return names_list, href_list
def get_pal_details(pal_url):
""" Fetches details of a Pal from the provided URL.
Args:
pal_url (str): url of individual pal
Returns:
stats_dict (dict): the name, level, rarity, and type stats of individual pal
"""
try:
response = requests.get(pal_url)
if response.status_code == 404:
raise ValueError("404 Not Found")
pal_html = response.text
pal_soup = BeautifulSoup(pal_html, 'html.parser')
# Get the name of the Pal
pal_name_tag = pal_soup.find("h1")
if pal_name_tag:
pal_name = pal_name_tag.text.strip()
else:
raise ValueError("Name not found")
# Get the rarity and level of the Pal
rarity_classes = ["epic rarity", "legendary rarity", "rare rarity", "common rarity"]
pal_rarity = None
pal_level = None
for rarity_class in rarity_classes:
rarity_div = pal_soup.find("div", class_=rarity_class)
if rarity_div:
level_tag = rarity_div.find("div", class_="lv")
pal_level = int(re.sub(r'\D', '', level_tag.text.strip())) if level_tag else None
rarity_tag = rarity_div.find("div", class_="name")
pal_rarity = rarity_tag.text.strip() if rarity_tag else None
break
# Get the type (element) of the Pal
elements_div = pal_soup.find("div", class_="elements")
element_list = []
if elements_div:
element_items = elements_div.find_all("div", class_="element")
element_list = [element.find("div", class_="name").text.strip() for element in element_items]
# Get the stats of the Pal
stats_div = pal_soup.find("div", class_="stats")
stats_dict = {}
if stats_div:
stats_items = stats_div.find_all("div", class_="item")
for stat in stats_items:
stat_name_tag = stat.find("div", class_="name")
stat_value_tag = stat.find("div", class_="value")
if stat_name_tag and stat_value_tag:
stat_name = stat_name_tag.text.strip()
stat_value = int(stat_value_tag.text.strip()) # Convert to integer
stats_dict[stat_name] = stat_value
return {
'name': pal_name,
'rarity': pal_rarity,
'level': pal_level,
**stats_dict, # Unpack the stats_dict into the return dictionary
'types': element_list
}
except Exception as e:
print(f"Error processing {pal_url}: {e}")
return {
'name': "404 Not Found",
'rarity': None,
'level': None,
**{stat: None for stat in stats_dict.keys()},
'types': []
}
def get_all_pals_details(href_list):
""" Retrieves details for all Pals based on the list of href links.
Args:
href_list (list): list of pal href links
Returns:
all_pals_details (list): lift of all pal details
"""
base_url = "https://palworld.gg"
all_pals_details = []
for href in href_list:
pal_url = base_url + href
pal_details = get_pal_details(pal_url)
all_pals_details.append(pal_details)
return all_pals_details
# URL of the Pals page
pals_stats_url = "https://palworld.gg/pals"
# Extract names and href links using the function
names_list, href_list = extract_soup(pals_stats_url)
# Get details for all Pals
all_pals_details = get_all_pals_details(href_list)
# Convert the collected data into a DataFrame
df = pd.DataFrame(all_pals_details)
# Display the DataFrame
df.head()
| name | rarity | level | HP | Defense | Crafting Speed | Melee Attack | Shot Attack | Price | Stamina | Support | Running Speed | Sprinting Speed | Slow Walk Speed | types | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Anubis | Epic | 10 | 120 | 100 | 100 | 130 | 130 | 4960 | 100 | 100 | 800 | 1000 | 80 | [Earth] |
| 1 | Arsox | Common | 4 | 85 | 95 | 100 | 100 | 95 | 3520 | 100 | 100 | 600 | 800 | 87 | [Fire] |
| 2 | Astegon | Epic | 9 | 100 | 125 | 100 | 100 | 125 | 8200 | 300 | 100 | 600 | 800 | 100 | [Dragon, Dark] |
| 3 | Azurobe | Rare | 7 | 110 | 100 | 100 | 70 | 100 | 5600 | 100 | 100 | 600 | 800 | 75 | [Water, Dragon] |
| 4 | Beakon | Rare | 6 | 105 | 80 | 100 | 100 | 115 | 7490 | 160 | 100 | 750 | 1200 | 100 | [Electricity] |
csv_path = 'all_pals_info.csv'
df.to_csv(csv_path, index=False)
df = pd.read_csv('all_pals_info.csv')
# Specify the columns to scale (excluding 'level')
columns_to_scale = ['HP', 'Defense', 'Crafting Speed', 'Melee Attack', 'Price',
'Shot Attack', 'Stamina', 'Support', 'Running Speed',
'Sprinting Speed', 'Slow Walk Speed']
# Initialize the StandardScaler
scaler = StandardScaler()
# Scale the specified columns
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
# Save the scaled dataset to a new CSV file
scaled_csv_path = 'scaled_all_pals_info.csv'
df.to_csv(scaled_csv_path, index=False)
df_pal = pd.read_csv('scaled_all_pals_info.csv')
df_pal.head()
| name | rarity | level | HP | Defense | Crafting Speed | Melee Attack | Shot Attack | Price | Stamina | Support | Running Speed | Sprinting Speed | Slow Walk Speed | types | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Anubis | Epic | 10 | 1.280843 | 0.525005 | 0.0 | 1.853478 | 1.731345 | 0.179085 | -0.291512 | 0.026534 | 1.010247 | 0.739855 | 0.054645 | ['Earth'] |
| 1 | Arsox | Common | 4 | -0.465484 | 0.262503 | 0.0 | 0.242327 | 0.041222 | -0.319997 | -0.291512 | 0.026534 | 0.152021 | 0.139879 | 0.193435 | ['Fire'] |
| 2 | Astegon | Epic | 9 | 0.282942 | 1.837518 | 0.0 | 0.242327 | 1.489899 | 1.302020 | 4.054663 | 0.026534 | 0.152021 | 0.139879 | 0.451187 | ['Dragon', 'Dark'] |
| 3 | Azurobe | Rare | 7 | 0.781892 | 0.525005 | 0.0 | -1.368823 | 0.282669 | 0.400899 | -0.291512 | 0.026534 | 0.152021 | 0.139879 | -0.044490 | ['Water', 'Dragon'] |
| 4 | Beakon | Rare | 6 | 0.532417 | -0.525005 | 0.0 | 0.242327 | 1.007007 | 1.055945 | 1.012341 | 0.026534 | 0.795690 | 1.339831 | 0.451187 | ['Electricity'] |
plot_feature_means_by_level(df, x_feat_list)
Plots the mean values of specified features by level.
plot_hp_distribution_by_rarity(csv_file)
Creates a box plot to visualize the distribution of health among rarities.
plot_price_vs_level_with_regression(csv_file)
Plots the price of a Pal against their rarity level with a regression line.
def plot_feature_means_by_level(df, x_feat_list):
""" Plots the mean values of specified features by level.
Args:
df (DataFrame): data frame containing the pals data.
x_feat_list (list): List of features to plot.
Returns:
None: line plot of level vs mean value of pal features
"""
# Define the range of levels to include in the plot
levels_range = (1, 10)
# Define the title of the plot
title = 'Mean Values of Features by Level (Excluding Price)'
# Calculate means for each feature by level
means_by_level = df.groupby('level')[x_feat_list].mean()
# Filter levels to be within the specified range
filtered_levels = range(levels_range[0], levels_range[1] + 1)
means_by_level_filtered = means_by_level.loc[filtered_levels]
# Plot the features
plt.figure(figsize=(10, 6))
colors = plt.cm.get_cmap('tab10', len(x_feat_list))
for i, feature in enumerate(x_feat_list):
plt.plot(means_by_level_filtered.index, means_by_level_filtered[feature],
label=f'{feature}', linestyle='-', marker='o', color=colors(i))
# Adding labels, title, and legend
plt.xlabel('Level')
plt.ylabel('Value')
plt.title(title)
plt.legend(loc='upper left')
plt.show()
x_feat_list = ['HP', 'Defense', 'Melee Attack',
'Shot Attack', 'Stamina', 'Support', 'Running Speed',
'Sprinting Speed', 'Slow Walk Speed']
# Assuming df_pal is your DataFrame
plot_feature_means_by_level(df_pal, x_feat_list)
/var/folders/fb/ym0ddg2s2szgznymfq_5jfj40000gn/T/ipykernel_2122/178519516.py:26: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
colors = plt.cm.get_cmap('tab10', len(x_feat_list))
This graph represents the mean values of various features plotted against different levels, excluding price. Each line in the graph corresponds to a different feature, and the y-axis indicates the value of these features, while the x-axis represents the levels.
From the graph, it is evident that some features show a clear upward trend as the level increases, indicating a positive correlation with level. For example, Defense, HP, and Shot Attack display a consistent increase in their mean values as the level rises. This suggests that higher levels are associated with greater values for these features, implying that they are positively correlated with the level. Notably, Shot Attack has a sharp increase towards the higher levels, making it one of the most strongly correlated features with the level.
Conversely, some features, such as Slow Walk Speed, show a decreasing trend as the level increases. This negative correlation suggests that as the level goes up, the mean value of Slow Walk Speed tends to decrease. Additionally, features like Sprinting Speed and Running Speed have more erratic patterns, with some fluctuations up and down, indicating less consistent relationships with the level.
In terms of correlation strength, Defense, HP, and Shot Attack appear to have the most prominent and direct positive relationship with the level. Their steady upward trajectories make them key indicators of higher levels, as their values increase consistently across the levels. This aligns with the coefficient values previously discussed, where these features had the highest coefficients, reflecting their significant impact on the target variable.
Overall, this graph provides valuable insights into how different features behave across levels, highlighting those that are more strongly correlated with the level, either positively or negatively.
def plot_hp_distribution_by_rarity(csv_file):
""" Creates an interactive box plot to visualize the distribution of Health
Points (HP) across different rarity levels of Pals.
Args
csv_file (str): The path to the CSV file containing the Pals data. The CSV
file must contain columns 'rarity', 'HP', 'name', and 'types'.
Returns:
None : Displays an interactive plot showing the distribution of HP by
rarity, which also includes outliers and hover information for each data point.
"""
# Example: plot_hp_distribution_by_rarity('all_pals_info.csv')
# Load the data
df_pals = pd.read_csv(csv_file)
# Define the custom order for the rarity categories
rarity_order = ['Common', 'Rare', 'Epic', 'Legendary']
# Create the interactive box plot
fig = px.box(df_pals, x='rarity', y='HP', points='outliers',
hover_data=['name', 'rarity', 'types'],
category_orders={'rarity': rarity_order},
title='HP Distribution by Rarity',
labels={'rarity': 'Rarity', 'HP': 'Health Points (HP)'})
# Show the plot
fig.update_layout(
font=dict(
size=18,
color="black"))
fig.show()
plot_hp_distribution_by_rarity('all_pals_info.csv')
This box plot illustrates the distribution of Health Points (HP) across different rarity levels of Pals: Common, Rare, Epic, and Legendary. The x-axis represents the rarity levels, while the y-axis shows the HP values.
Key observations include the median HP, which increases from Common to Legendary Pals, suggesting that HP tends to rise with rarity. The Interquartile Range (IQR) is narrow for Common Pals, indicating that their HP values are closely clustered. In contrast, higher rarity levels like Epic and Legendary show more consistent HP values, with a tighter IQR. Notably, there are outliers in the Rare category, with some Pals having unusually high or low HP values, indicating greater variability.
Overall, this box plot reveals that higher rarity levels are generally associated with higher and more consistent HP values, suggesting that HP could be a significant indicator of a Pal's rarity. The variability in lower rarity levels, coupled with the presence of outliers, points to a greater diversity in HP among Common and Rare Pals.
def plot_price_vs_level_with_regression(csv_file, exclude_rarity='Legendary', label_size=18):
""" Creates an interactive scatterplot of Price vs Level, excluding a specified rarity,
and overlays a line of best fit derived from linear regression. The plot includes
hover information and is color-coded by rarity.
Args:
csv_file (str): The path to the CSV file containing the Pals data. The CSV
file must contain columns 'Price', 'level', 'rarity', 'name', and 'types'.
exclude_rarity (str): (optional) The rarity level to exclude from the plot(default is 'Legendary').
label_size (int): (optional) The font size for the plot labels (default is 18).
Returns:
None: Displays an interactive scatterplot with a line of best fit and prints
the linear regression results including the slope, intercept, and R-squared value.
"""
# Example: plot_price_vs_level_with_regression('all_pals_info.csv')
# Load the data
df_pals = pd.read_csv(csv_file)
# Exclude specified rarity
df_pals_filtered = df_pals[df_pals['rarity'] != exclude_rarity]
# Perform linear regression to find the line of best fit
slope, intercept, r_value, p_value, std_err = linregress(df_pals_filtered['Price'], df_pals_filtered['level'])
# Calculate the line of best fit
df_pals_filtered['Best Fit Line'] = intercept + slope * df_pals_filtered['Price']
# Create the scatterplot with hover information, color-coded by rarity
fig = px.scatter(df_pals_filtered, x='Price', y='level',
color='rarity',
hover_data={'name': True, 'rarity': True, 'types': True},
title=f'Scatterplot of Price vs Level (Excluding {exclude_rarity}) with Line of Best Fit',
labels={'Price': 'Price', 'level': 'Level (Rarity Number)'})
# Add the line of best fit to the plot
fig.add_traces(px.line(df_pals_filtered, x='Price', y='Best Fit Line').data)
# Increase label size
fig.update_layout(
font=dict(size=label_size, color="black")
)
# Show the plot
fig.show()
# Print the linear regression results with more precision
print(f"Slope: {slope:.5f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared: {r_value**2:.4f}")
plot_price_vs_level_with_regression('all_pals_info.csv')
/var/folders/fb/ym0ddg2s2szgznymfq_5jfj40000gn/T/ipykernel_2122/3339407048.py:29: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Slope: 0.00083 Intercept: 1.53 R-squared: 0.6866
This scatterplot illustrates the relationship between the price of Pals and their rarity level, excluding Legendary Pals. The X-axis represents the price, while the Y-axis represents the rarity level. The scatterplot reveals a positive linear relationship between a Pal's level and its price, indicating that as the rarity level of a Pal increases, the price tends to rise as well. The line of best fit confirms this trend, showing that price generally increases with level.
However, the R-squared value of 0.69 suggests a moderately strong correlation, meaning that approximately 69% of the variance in price can be explained by the Pal's level. This indicates that while level is an important factor, other elements likely contribute to price variation. Additionally, the scatterplot identifies outliers, indicating that some Pals' prices do not follow the typical pattern. These outliers suggest that factors beyond rarity level, such as unique abilities or features, may significantly influence price.
The overall distribution shows that while there is a clear upward trend, the price of Pals at any given level can vary widely. This variability highlights that level is not the sole determinant of price, particularly at lower levels where pricing is more diverse. As the rarity level increases, pricing becomes more consistent, reflecting a more predictable trend.
In conclusion, the scatterplot effectively demonstrates that as the rarity level of a Pal increases, so does its price. The presence of outliers and the moderate R-squared value indicate that while level is a significant factor, other characteristics also play crucial roles in determining the price.
prepare_data(df, x_feat_list)
Prepare the data by extracting features and adding a constant term.
fit_ols_model(X, y)
Fit an OLS regression model and return the model summary and parameters.
perform_kfold(X, y, n_splits=5, random_state = 42)
Perform K-Fold cross-validation and return R-squared and MSE values.
plot_residuals_vs_index(residuals)
Plot the index vs. the errors (residuals).
plot_features_vs_residuals(df, x_feat_list, residuals, valid_indices)
Plot each feature vs. the errors (residuals).
plot_qq(residuals)
Plot the normal probability quantile-quantile plot of the errors (residuals).
def prepare_data(df, x_feat_list):
""" Prepare the data by extracting features and adding a constant term.
Args:
df (DataFrame):
x_feat_list (list): list of pal features
Returns:
X (DataFrame): input values
y (series): output values
"""
X = df[x_feat_list]
X = sm.add_constant(X)
y = df['level']
return X, y
def fit_ols_model(X, y):
""" Fit an OLS regression model and return the model summary and parameters.
Args:
X (DataFrame): input values
y (series): ouput values
Returns:
model.summary() (statsmodels): OLS Regression Results (including coef, std error,
and t values for the x features).
model.params (float): coefficient values for the (x) features from the model
"""
model = sm.OLS(y, X).fit()
return model.summary(), model.params
def perform_kfold(X, y, n_splits=5, random_state = 42):
""" Perform K-Fold cross-validation and return R-squared and MSE values.
Args:
X (DataFrame): input values
y (series): output values
Returns:
r2_full (float): R-squared value
mse_full (float): Mean squared error value
y_pred_full (list): list of predicted rarity values
valid_indices (tuple): the data only include for 1 to 10
"""
# Filter the data to only include levels from 1 to 10
valid_indices = np.where((y >= 1) & (y <= 10))
X = X[valid_indices]
y = y[valid_indices]
# Initialize KFold
kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
# Initialize array to store predictions
y_pred_full = np.empty_like(y)
# Loop through each split
for train_index, test_index in kf.split(X):
# Split the data into training and test sets
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Fit the linear regression model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict on the test set
y_pred = reg.predict(X_test)
y_pred_full[test_index] = y_pred
# Calculate R-squared and MSE values
r2_full = r2_score(y, y_pred_full)
mse_full = mean_squared_error(y, y_pred_full)
return r2_full, mse_full, y_pred_full, valid_indices
def plot_residuals_vs_index(residuals):
""" Plot the index vs. the errors (residuals).
Args:
residuals (np.arrays): errors
Returns:
None: scatterplot of index vs. the errors
"""
plt.figure(figsize=(10, 6))
plt.scatter(range(len(residuals)), residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Index')
plt.ylabel('Residuals')
plt.title('Index vs. Residuals')
plt.show()
def plot_features_vs_residuals(df, x_feat_list, residuals, valid_indices):
""" Plot each feature vs. the errors (residuals).
Args:
df (DataFrame): data frame of pals
x_feat_list (list): list of str containing pal features
residuals (np.array): the errors
valid_indices (tuple): the data only include for 1 to 10
Returns:
None: plot of the pal fetaures vs errors
"""
# filters the data to only include pals with levels 1- 10
df_filtered = df.iloc[valid_indices]
num_features = len(x_feat_list)
num_rows = (num_features + 2) // 3 # Calculate rows needed for the given number of features (3 columns)
fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(15, num_rows * 5))
fig.suptitle('Feature vs. Residuals')
for i, feature in enumerate(x_feat_list):
row, col = divmod(i, 3)
sns.scatterplot(x=df_filtered[feature], y=residuals, ax=axes[row, col])
axes[row, col].axhline(y=0, color='r', linestyle='--')
axes[row, col].set_xlabel(feature)
axes[row, col].set_ylabel('Residuals')
plt.tight_layout(rect=[0, 0.06, 1, 0.95])
plt.show()
def plot_qq(residuals):
""" Plot the normal probability quantile-quantile plot of the errors (residuals).
Args:
residuals (np.array): the errors
Returns:
None: a normal probability quantile-quantile plot
"""
plt.figure(figsize=(10, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Normal Probability Q-Q Plot of Residuals')
plt.show()
# 1. Prepare the data
x_feat_list = ['HP', 'Defense', 'Melee Attack',
'Shot Attack', 'Stamina', 'Support', 'Running Speed',
'Sprinting Speed', 'Slow Walk Speed']
X, y = prepare_data(df_pal, x_feat_list)
# 2. Fit the OLS model and display results
model_summary, model_params = fit_ols_model(X, y)
#print(model_summary)
print(model_params)
# 3. Perform K-Fold cross-validation and get the results
r2_full, mse_full, y_pred_full, valid_indices = perform_kfold(X.values, y.values)
print(f"R-squared value: {r2_full:.3f}")
print(f"mean square error: {mse_full:.3f}")
const 5.707317 HP 1.525470 Defense 0.802142 Melee Attack -0.506768 Shot Attack 1.674647 Stamina -0.161568 Support -0.123691 Running Speed 0.480275 Sprinting Speed 0.296393 Slow Walk Speed 0.172998 dtype: float64 R-squared value: 0.782 mean square error: 1.701
# Calculate residuals
residuals = y.values[valid_indices] - y_pred_full
# 4. Plot index vs. residuals
plot_residuals_vs_index(residuals)
# 5. Plot each feature vs. residuals
plot_features_vs_residuals(df_pal, x_feat_list, residuals, valid_indices)
# 6. Plot the Q-Q plot of residuals
plot_qq(residuals)
Before running our regression models, we examined the coefficient table for each feature included in our x_feat_list. This allowed us to understand the individual contribution of each feature to the target variable, which in this case is the level or rarity of the items being analyzed.
Among the features, the coefficients for HP, defense, and shot attack stood out with values of 1.53, 0.80, and 1.67 respectively. The fact that these coefficients are all positive suggests that these features are positively correlated with level, meaning that as HP, defense, and shot attack increase, the level or rarity of the items also tends to increase.
Notably, these three features—HP, defense, and shot attack—have the greatest absolute values among all the coefficients, indicating they are the most influential predictors in the model. This insight is further supported by Visualization 1, which shows the mean values of each feature by level. The visualization reflects how these top features vary across different levels, providing a clear visual representation of their impact on the target variable. This alignment between the coefficient table and the visualization helps validate the model's findings and reinforces the importance of these features in predicting the level or rarity of the items.
For our multiple regression model, we employed a 5-fold cross-validation technique to assess its performance across different train-test splits. By running this cross-validation, we obtained full predictions for the target variable, y, across the entire dataset. When comparing these predicted values to the true labels, the model yielded an R-square value of 0.782. This value, being relatively close to 1, suggests a reasonably good fit, indicating that the model explains a substantial portion of the variance in the target variable.
In addition to the R-square value, we also evaluated the model using the Mean Squared Error (MSE), which came out to be 1.701. This MSE value is relatively low, further supporting the notion that the model has a good fit. However, to fully understand the model's performance, it would be insightful to compare this MSE with those from other regression models, as it provides a more contextual understanding of how well the model is performing.
Beyond evaluating the fit, it’s crucial to check whether the regression assumptions hold true, as these assumptions underpin the validity of the regression model’s results. First, we examined the independence of residuals, which states that the residuals (errors) should be independent of each other. Upon plotting the residuals, we observed that the points were mostly evenly and randomly distributed. However, one minor observation is that there seems to be a slight imbalance, with more points appearing above 0 than below. While this is not a major concern, it's something to be mindful of.
Next, we looked at the assumption of homoscedasticity, which requires that the residuals should have constant variance across all levels of the independent variables. To check this, we plotted residuals against each feature, resulting in nine plots. Most of these plots displayed a random and even distribution of residuals, which is a good sign. However, we noticed some outliers, particularly in the features of Running Speed and Sprinting Speed. These outliers suggest that for these specific features, the residuals may not have constant variance, indicating potential issues with heteroscedasticity.
Finally, we assessed the normality of residuals, which posits that the residuals should be normally distributed. We did this by comparing the distribution of residuals to a reference normal distribution line. Ideally, if the residuals were normally distributed, the points would closely follow or overlap with the red line. However, in our case, while the overall trend of the points is in the same direction as the red line, there is a noticeable deviation, indicating a departure from normality. This discrepancy arises because the target variable we are predicting is discrete rather than continuous. When we refer back to our dataset, we observe that the target variable is displayed with discrete numbers instead of continuous values, which violates the assumption that the dependent variable should be continuous in a regression model.
Given these observations, it’s essential to reconsider the model's assumptions and explore alternative approaches like classification modeling that better accommodate the nature of the data.
get_acc_sens_spec(y_true, y_pred)
Computes accuracy, sensitivity, and specificity for binary inputs.
convert_to_list(y_val)
Converts a list of (predicted) rarity values to list format.
convert_to_binary(rarity_list, label)
Converts a list of rarity types to binary based on a label.
get_acc_sens_spec_each_rarity(y_true, y_pred)
Computes accuracy, sensitivity, and specificity for each rarity.
train_and_evaluate_classifier(x, y, model, n_splits=10)
Trains and evaluates a classifier using Stratified K-Fold cross-validation.
plot_conf_matrix(y_true, y_pred, labels, title='Confusion Matrix', cmap='Blues')
Plots the confusion matrix for the given true and predicted labels.
plot_decision_tree(model, feature_names, class_names, figsize=(15, 10))
Plots the decision tree for the provided model.
def convert_to_list(y_val):
""" Converts the list of (predicted) rarity values to proper list format
Args:
y_val (np.array): predicted rarity values
Returns:
y_list (list): list of predicted
"""
# converts the array to list format
y_arr = np.array(y_val)
y_list_full = str(y_arr.tolist())
y_list = re.findall('[A-Z][^A-Z]*', y_list_full)
# cleans the elements
for y in range(len(y_list) - 1):
# removes the extra 4 characters from each element
y_list[y] = y_list[y][:-4]
# removes the extra two characters from the last element
y_list[-1] = y_list[-1][:-2]
return y_list
def convert_to_binary(rarity_list, label):
"""Converts the list of rarity types to binary based on label
Args:
rarity_list (list): input integer
label (str): rarity label
Returns:
bin_list (list): list of binary values
"""
return [1 if val == label else 0 for val in rarity_list]
def get_acc_sens_spec(y_true, y_pred):
""" computes sensitivity & specificity (assumed binary inputs)
Args:
y_true (np.array): binary ground truth per trial
y_pred (np.array): binary prediction per trial
Returns:
acc (float): accuracy
sens (float): sensitivity
spec (float): specificity
"""
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=(0, 1)).ravel()
# Compute sensitivity
sens = tp / (tp + fn) if tp + fn else np.nan
# Compute specificity
spec = tn / (tn + fp) if tn + fp else np.nan
# Compute accuracy
acc = (tp + tn) / (tn + fp + fn + tp)
return acc, sens, spec
def get_acc_sens_spec_each_rarity(y_true, y_pred):
"""Computes accuracy, sensitivitym, & specificity for each rarity
Args:
y_true (np.array): true rarity
y_pred (np.array): predicted rarity
Returns:
acc (float): accuracy
sens (float): sensitivity
spec (float): specificity
"""
rarity_labels = ["Common", "Epic", "Legendary", "Rare"]
for label in rarity_labels:
# converts labels to binary
y_true_bin = convert_to_binary(y_true, label)
y_pred_bin = convert_to_binary(y_pred, label)
# plots confusion matrix for each rarity label
conf_matrix = confusion_matrix(y_true=y_true_bin, y_pred=y_pred_bin, labels=[0, 1])
conf_mat_disp = ConfusionMatrixDisplay(conf_matrix, display_labels=["Not " + label, label])
conf_mat_disp.plot()
plt.title(f"Confusion Matrix for the {label} rarity")
# gets acc, sens, & spec for each rarity label
acc, sens, spec = get_acc_sens_spec(y_true_bin, y_pred_bin)
print(f"For the {label} rarity:")
print("Accuracy:", acc)
print("Sensitivity:", sens)
print("Specificity:", spec)
def train_and_evaluate_classifier(x, y, model, n_splits=10):
"""Trains and evaluates a classifier using Stratified K-Fold cross-validation.
Args:
x (np.array): values that will be used as the predicter (test & train data)
y (np.array): true values that train the data
model (str): the type of classifier
n_splits (int): the number of how many times the model is cross-validated
Returns:
y_pred (np.array): the predicted y values
"""
kfold = StratifiedKFold(n_splits=n_splits)
y_pred = np.empty_like(y)
for train_idx, test_idx in kfold.split(x, y):
x_train, y_train = x[train_idx], y[train_idx]
x_test = x[test_idx]
model.fit(x_train, y_train)
y_pred[test_idx] = model.predict(x_test)
return y_pred
def plot_conf_matrix(y_true, y_pred, labels, title='Confusion Matrix', cmap='Blues'):
"""Plots the confusion matrix for the given true and predicted labels.
Args:
y_true (np.array): the true y values
y_pred (np.array): the predicted y values
labels (list): list of matrix labels (the unique values found in y_true)
title (str): title of matrix
cmap (str): sets color of matrix
Returns:
None: Displays the confusion matrix
"""
conf_matrix = confusion_matrix(y_true, y_pred)
con_disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=labels)
con_disp.plot(cmap=cmap)
plt.title(title)
plt.show()
def plot_decision_tree(model, feature_names, class_names, figsize=(15, 10)):
""" Plots the decision tree for the provided model.
Args:
model (model): The decision tree model to plot.
feature_names (list): List of feature names.
class_names (list): List of class names.
figsize (tuple): Size of the figure (default is (15, 10)).
Returns:
None: Displays the decision tree plot.
"""
plt.figure(figsize=figsize)
tree.plot_tree(model, feature_names=feature_names, class_names=class_names)
plt.show()
# Define features and labels
x_feat_list = ["HP", "Defense", "Crafting Speed", "Melee Attack",
"Shot Attack", "Stamina", "Support", "Running Speed",
"Sprinting Speed", "Slow Walk Speed"]
x = df_pal.loc[:, x_feat_list].values
y = df_pal.loc[:, "rarity"].values
# Initialize the Decision Tree classifier
dec_tree_clf = tree.DecisionTreeClassifier(max_depth=3, random_state=42)
# Train and evaluate the classifier
y_pred_tree = train_and_evaluate_classifier(x, y, dec_tree_clf)
# Plot the decision tree
plot_decision_tree(dec_tree_clf, feature_names=x_feat_list, class_names=dec_tree_clf.classes_.tolist())
# Plot confusion matrix and calculate accuracy
plot_conf_matrix(y, y_pred_tree, labels=dec_tree_clf.classes_.tolist(), title='Confusion Matrix for Decision Tree Classifier')
print("Overall Accuracy:", accuracy_score(y, y_pred_tree))
# Get accuracy, sensitivity, and specificity for each rarity
y_list = convert_to_list(y)
y_pred_list = convert_to_list(y_pred_tree)
get_acc_sens_spec_each_rarity(y_list, y_pred_list)
/Users/dannisjessy/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 7 members, which is less than n_splits=10.
Overall Accuracy: 0.6890243902439024 For the Common rarity: Accuracy: 0.8658536585365854 Sensitivity: 0.8153846153846154 Specificity: 0.898989898989899 For the Epic rarity: Accuracy: 0.7865853658536586 Sensitivity: 0.7027027027027027 Specificity: 0.8110236220472441 For the Legendary rarity: Accuracy: 0.9878048780487805 Sensitivity: 0.7142857142857143 Specificity: 1.0 For the Rare rarity: Accuracy: 0.7378048780487805 Sensitivity: 0.5272727272727272 Specificity: 0.8440366972477065
We implemented a decision tree and stratified K fold to predict rarity based on eight features, which were HP, Defense, Crafting Speed, Melee Attack, Shot Attack, Stamina, Support, Running Speed, Sprinting Speed, and Slow Walk Speed. For the decision tree, the max_depth was set to 3 and 10-fold cross-validation, with n_splits being set to 10, with random_state being set to 42.
While some of the leaves had gini scores equal to and close to zero, some of them were higher in comparison, having impurity values of 0.375, 0.499, and 0.534. Based on the gini score of the leaves, we can say that the tree needs some tuning and more classification is needed since some of the impurity values are not as close to zero as we would like them to be. Stratified k fold was used to split the data into test and training data which gave us our first confusion matrix which had an accuracy of 0.6890. The matrix successfully predicts 53 common rarity values, 26 epic rarity values, 5 legendary rarity values, and 29 rarity values. While the matrix does a fairly decent job of predicting the common rarity, its ability to predict the epic rarity and rare rarity is not as good in comparison. The matrix also does a fairly decent job of predicting the legendary rarity.
Due to there being four predicted rarity values, the original get_acc_sens_spec had to be modified and implemented so it gave the accuracy, sensitivity and specificity for each of the predicted rarity values individually. This was done by converting both the array of the true rarity values and the array of the predicted rarity values to lists using the convert_to_list function which were then imputed in the get_acc_sens_spec_each_rarity function. Within this function, it loops through each rarity label, converting the y_true list and y_pred_list to binary with 1 representing the individual rarity that's being evaluated and 0 representing everything else. For example, of the Common rarity is being evaluated, ['Epic', 'Common', 'Epic', 'Epic', 'Rare', 'Common', 'Epic', 'Legendary', 'Epic', 'Epic','Rare', 'Rare'] get converted to [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ]. This function then created a confusion matrix for each rarity which displayed the true positive, true negative, false positive, and false negative values for each rarity. It then calls the get_acc_sens_spec function to get the accuracy, sensitivity and specificity scores for each of the rarities.
For the common rarity, the accuracy score was found to be 0.8659, the sensitivity score was found to be 0.8154, and the specificity score was found to be 0.899. For the epic rarity, the accuracy score was found to be 0.7866, the sensitivity score was found to be 0.7027, and the specificity score was found to be 0.8110. For the legendary rarity, the accuracy score was found to be 0.9878, the sensitivity score was found to be 0.714, and the specificity score was found to be 1.0.For the rare rarity, the accuracy score was found to be 0.7378 , the sensitivity score was found to be 0.5273, and the specificity score was found to be 0.8440. The individual rarity accuracy scores were found to be higher than the accuracy score of the entire matrix.
plot_feat_import(feat_list, feat_import, sort=True, limit=None)
Plots feature importances in a horizontal bar chart.
# Write the code to run functions to fit each model in separate code chunks.
# Interpret the model results.
def plot_feat_import(feat_list, feat_import, sort=True, limit=None):
"""Plots feature importances in a horizontal bar chart.
Args:
feat_list (list): str names of features
feat_import (np.array): feature importances (mean gini reduce)
sort (bool): if True, sorts features in decreasing importance
from top to bottom of plot
limit (int): if passed, limits the number of features shown
to this value
Returns:
None: barchart of important features
"""
if sort:
idx = np.argsort(feat_import).astype(int)
feat_list = [feat_list[_idx] for _idx in idx]
feat_import = feat_import[idx]
if limit is not None:
feat_list = feat_list[-limit:]
feat_import = feat_import[-limit:]
plt.barh(feat_list, feat_import)
plt.gcf().set_size_inches(5, len(feat_list) / 2)
plt.xlabel('Feature importance\n(Mean decrease in Gini across all Decision Trees)')
plt.show()
# Initialize the Random Forest classifier
max_depth = 3
rf_clf = RandomForestClassifier(max_depth = max_depth, n_estimators = 500, random_state = 42)
# Train and evaluate the classifier
#y_pred_rf = train_and_evaluate_classifier(x, y, rf_clf)
rf_clf.fit(x,y)
y_pred_rf = rf_clf.predict(x)
y_pred_rf
# Plot confusion matrix and calculate accuracy
plot_conf_matrix(y, y_pred_rf, labels=rf_clf.classes_, title='Confusion Matrix for Random Forest Classifier')
print("Overall Accuracy:", accuracy_score(y, y_pred_rf))
# Get accuracy, sensitivity, and specificity for each rarity
y_pred_rf_list = convert_to_list(y_pred_rf)
get_acc_sens_spec_each_rarity(y_list, y_pred_rf_list)
Overall Accuracy: 0.8597560975609756 For the Common rarity: Accuracy: 0.9390243902439024 Sensitivity: 0.8769230769230769 Specificity: 0.9797979797979798 For the Epic rarity: Accuracy: 0.9207317073170732 Sensitivity: 0.8378378378378378 Specificity: 0.9448818897637795 For the Legendary rarity: Accuracy: 0.9817073170731707 Sensitivity: 0.5714285714285714 Specificity: 1.0 For the Rare rarity: Accuracy: 0.8780487804878049 Sensitivity: 0.8909090909090909 Specificity: 0.8715596330275229
# Plot feature importances
plot_feat_import(x_feat_list, rf_clf.feature_importances_, limit=8)
# Define features and labels
x_feat_list_5 = ["HP", "Defense", "Shot Attack", "Slow Walk Speed", "Sprinting Speed"]
x_5 = df_pal.loc[:, x_feat_list_5].values
# Initialize the Decision Tree classifier
dec_tree_clf_5 = tree.DecisionTreeClassifier(max_depth=3, random_state=42)
# Train and evaluate the classifier
y_pred_tree_5_feat = train_and_evaluate_classifier(x_5, y, dec_tree_clf_5)
# Plot the decision tree
plt.figure(figsize=(15, 10))
tree.plot_tree(dec_tree_clf_5, feature_names=x_feat_list_5, class_names=dec_tree_clf_5.classes_.tolist())
plt.show()
# Plot confusion matrix and calculate accuracy
plot_conf_matrix(y, y_pred_tree_5_feat, labels=dec_tree_clf_5.classes_, title='Confusion Matrix for Decision Tree Classifier (Top 5 Features)')
print("Overall Accuracy:", accuracy_score(y, y_pred_tree_5_feat))
# Get accuracy, sensitivity, and specificity for each rarity
y_pred_tree_5_feat_list = convert_to_list(y_pred_tree_5_feat)
get_acc_sens_spec_each_rarity(y_list, y_pred_tree_5_feat_list)
/Users/dannisjessy/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 7 members, which is less than n_splits=10.
Overall Accuracy: 0.6951219512195121 For the Common rarity: Accuracy: 0.8719512195121951 Sensitivity: 0.8153846153846154 Specificity: 0.9090909090909091 For the Epic rarity: Accuracy: 0.7926829268292683 Sensitivity: 0.7297297297297297 Specificity: 0.8110236220472441 For the Legendary rarity: Accuracy: 0.9939024390243902 Sensitivity: 0.8571428571428571 Specificity: 1.0 For the Rare rarity: Accuracy: 0.7317073170731707 Sensitivity: 0.509090909090909 Specificity: 0.8440366972477065
# Initialize the Decision Tree classifier for max_depth = 4
dec_tree_clf_5 = tree.DecisionTreeClassifier(max_depth=4, random_state=42)
# Train and evaluate the classifier
y_pred_tree_5_feat = train_and_evaluate_classifier(x_5, y, dec_tree_clf_5)
# Plot the decision tree
plt.figure(figsize=(15, 10))
tree.plot_tree(dec_tree_clf_5, feature_names=x_feat_list_5, class_names=dec_tree_clf_5.classes_.tolist())
plt.show()
# Plot confusion matrix and calculate accuracy
plot_conf_matrix(y, y_pred_tree_5_feat, labels=dec_tree_clf_5.classes_, title='Confusion Matrix for Decision Tree Classifier (Top 5 Features)')
print("Overall Accuracy:", accuracy_score(y, y_pred_tree_5_feat))
# Get accuracy, sensitivity, and specificity for each rarity
y_pred_tree_5_feat_list = convert_to_list(y_pred_tree_5_feat)
get_acc_sens_spec_each_rarity(y_list, y_pred_tree_5_feat_list)
/Users/dannisjessy/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 7 members, which is less than n_splits=10.
Overall Accuracy: 0.725609756097561 For the Common rarity: Accuracy: 0.8841463414634146 Sensitivity: 0.8 Specificity: 0.9393939393939394 For the Epic rarity: Accuracy: 0.8292682926829268 Sensitivity: 0.5945945945945946 Specificity: 0.8976377952755905 For the Legendary rarity: Accuracy: 0.9939024390243902 Sensitivity: 0.8571428571428571 Specificity: 1.0 For the Rare rarity: Accuracy: 0.7439024390243902 Sensitivity: 0.7090909090909091 Specificity: 0.7614678899082569
For a more accurate prediction, Random Forest classification was then implemented to predict rarity using the same eight features, "HP", "Defense", "Crafting Speed", "Melee Attack", "Shot Attack", "Stamina", "Support", "Running Speed", "Sprinting Speed", and "Slow Walk Speed", as well as a max_ depth of 3, random_state of 42, and n_estimators being set to 500. Using the predicted values from the Random Forest classification, another confusion matrix was created which had an accuracy of 0.8596. The matrix successfully predicts 57 common rarity values, 31 epic rarity values, 4 legendary rarity values, and 49 rarity values. This confusion matrix does a much better job at predicting all of the common, rare, and epic rarity types compared to the previous matrix. The drawback of this classification is that it does not predict the legendary rarity as well as the previous matrix does.
The individual accuracy, sensitivity, and specificity scores were then found using the convert_to_list and get_acc_sens_spec_each_rarity functions as well as initial confusion matrices for each of the rarity types. For the common rarity, the accuracy score was found to be 0.9390, the sensitivity score was found to be 0.8769, and the specificity score was found to be 0.9798. For the epic rarity, the accuracy score was found to be 0.9207, the sensitivity score was found to be 0.8378, and the specificity score was found to be 0.9449. For the legendary rarity, the accuracy score was found to be 0.9817, the sensitivity score was found to be 0.5714, and the specificity score was found to be 1.0. For the rare rarity, the accuracy score was found to be 0.878, the sensitivity score was found to be 0.8909, and the specificity score was found to be 0.8716.
To identify the top 5 features that are most important when it comes to predicting rarity, the plot_feat_import function was used to create a horizontal bar plot that shows the importance score of each x feature. The top 5 features that were found to be most important (from most to least) are HP, Shot Attack, Defense, Slow Walk Speed, and Sprinting Speed. HP was found to be most important, having a score of 0.31, with Shot Attack right behind with a score of 0.27. Defense had an importance score of about 0.19 and Slow Walk Speed and Sprinting Speed had importance scores of about 0.065.
Using only these top 5 features, the decision tree and confusion matrix from model 2 was recreated using max_depth of 3, 10-fold cross-validation, with n_splits being set to 10, and random state being set to 42. This decision tree model tree was found to be identical to the original tree (from Model 2) which verifies that Shot Attack, HP, Defense, Slow Walk Speed, and Sprinting Speed are the key features.
The accuracy of this model has slightly improved, with a score of 0.6951, when compared to the score of 0.6890 from model 2. The matrix successfully predicts 53 common rarity values, 27 epic rarity values, 6 legendary rarity values, and 28 rarity values. While the matrix's ability to predict the common and epic rarity has improved, the same can not be said for the rare rarity. Its ability to predict the legendary rarity has stayed the same. Hence there are trade offs in using the top 5 features or all features to predict the rarity.
The individual accuracy, sensitivity, and specificity scores were then found using the convert_to_list and get_acc_sens_spec_each_rarity functions as well as initial confusion matrices for each of the rarity types. For the common rarity, the accuracy score was found to be 0.872, the sensitivity score was found to be 0.8154, and the specificity score was found to be 0.9091. For the epic rarity, the accuracy score was found to be 0.7927, the sensitivity score was found to be 0.7297, and the specificity score was found to be 0.8110. For the legendary rarity, the accuracy score was found to be 0.9939, the sensitivity score was found to be 0.8571, and the specificity score was found to be 1.0. For the rare rarity, the accuracy score was found to be 0.7317, the sensitivity score was found to be 0.5091, and the specificity score was found to be 0.844.
The final decision tree, like the previous one, used the top five important features, 10-fold cross-validation, with n_splits being set to 10, and random state being set to 42, with the one difference of max_depth was set to 4. Stratified k fold was used to create the confusion matrix which had an accuracy of 0.7256, which has slightly improved when compared to the previous one. While this tree is harder to understand, the gini score on the leaves has decreased when compared to the two previous trees. This tells us that this tree has a lower impurity when compared to it’s predecessor as well as that it is less underfit. The matrix successfully predicts 52 common rarity values, 22 epic rarity values, 6 legendary rarity values, and 39 rarity values. When comparing his matrix to the previous one, its ability to predict the rarity of the pals has improved, legendary has stayed the same, and common and epic have decreased.
The individual accuracy, sensitivity, and specificity scores were then found using the convert_to_list and get_acc_sens_spec_each_rarity functions as well as initial confusion matrices for each of the rarity types. For the common rarity, the accuracy score was found to be 0.8841, the sensitivity score was found to be 0.8, and the specificity score was found to be 0.0.9394. For the epic rarity, the accuracy score was found to be 0.8293, the sensitivity score was found to be 0.5946, and the specificity score was found to be 0.8976. For the legendary rarity, the accuracy score was found to be 0.9939, the sensitivity score was found to be 0.8571, and the specificity score was found to be 1.0. For the rare rarity, the accuracy score was found to be 0.7439, the sensitivity score was found to be 0.7091, and the specificity score was found to be 0.7615.
When inspecting the differences in the individual confusion matrices when max_depth is set to 3 and 4, several interesting observations were made. When max depth was increased from 3 to 4, both the false negative and false positive scores were increased in the common and epic matrices while the true positive and false positive scores decreased. When max depth was increased from 3 to 4, the opposite seemed to occur for the rare matrix. The false negative and false positive scores decreased while the true positive and false positive scores increased.
All of the individual rarity accuracy scores are all better than their respective accuracy scores of their entire matrix. After comparing all of the individual accuracy, sensitivity, and specificity scores, we observed that the random forest model produced the best accuracy, sensitivity, and specificity scores for the common, rare, and epic rarity types. A reason why the legendary rarity always had a specificity score of 1 could be due to there only being 7 legendary pals in the data set, which is a small sample size. This could be the same reason why both the decision tree with the top 5 features model with a max_depth of 3 and max_depth of 4 had the same accuracy and sensitivity scores.
After comparing all of the individual confusion matrices for each rarity type, we observed that the random forest classifier produced the best ones for the common, rare, and epic rarity types since these three individual matrices have the lowest false positive and false negative scores for each rarity type. The decision tree for the top 5 features (both with a max_depth of 3 and max_depth of 4) resulted in the best individual confusion matrix for the legendary rarity type. This matrix has a false positive score of 1 and a false negative score of 0.
The findings from this study underscore the importance of selecting the right features and models for accurately predicting the rarity of items in Palworld. We made the strategic decision to exclude legendary items (level 20) for Visulization 3 and multiple regression model. This exclusion was based on the understanding that including such a small and extreme category could potentially skew the results, reducing the overall accuracy and predictive power of our models. We also chose to exclude price as a feature in our models. This decision was driven by the rationale that price is more likely a consequence of rarity rather than a determinant of it by observing Visualization 3. We successully answered our research question. HP, Shot Attack, and Defense emerged as the most critical factors, significantly influencing rarity predictions which is suggested and validated by the trends from feature mean by level visualization, the coefficient table for multiple regression model and the feature importance plot by Random Forest. Although the multiple regression model did not ultimately prove effective due to the discrete nature of the level value, it nonetheless provided valuable insights into our research question. Among the two categorical models, the Random Forest model outperformed other models in predicting common, rare, and epic rarities, while the Decision Tree was most effective for the rare legendary category, likely due to the small sample size of legendary Pals. These results suggest that ensemble methods like Random Forest are better equipped to capture the complexities of the data, particularly when dealing with multiple, interrelated features.
While the Random Forest model showed strong performance, particularly for common, rare, and epic rarities, there is room for improvement through further model tuning and handling class imbalances, especially for the underrepresented legendary category. Additionally, exploring feature engineering, such as creating new features or examining interactions between existing ones, could enhance the model's predictive accuracy. This suggests that there is an inherent trade-off between accuracy, sensitivity, and specificity, highlighting the need to carefully select the optimal analysis method depending on the specific context and objectives.
Looking ahead, besides finding the best max_depth for the decision tree model for top five features which gives the best balance among accurary, sensitivity and specifity, it would be beneficial to investigate alternative classification algorithms like Gradient Boosting Machines (GBM) or Support Vector Machines (SVM) to see if they offer performance gains. Finally, implementing these models in a real-time system within the game could enable dynamic predictions and adjustments as new data is collected, further refining the accuracy and applicability of these models in practical settings.
Our team has divided the workload with Brey handling Visualization 2 and 3, along with their interpretations, and writing the Introduction and Data sections. Allie is responsible for Model 2 and 3, including all related interpretations. Enxi is tasked with Visualization 1, Model 1, and their interpretations, as well as writing the Discussion section and overseeing the overall organization and compilation of the report and project. We supported each other throughout the project, stepping in to assist whenever someone needed help in different areas.
We were particularly impressed by Group 3's presentation. Like our project, they focused on rarity and other related features. Their use of a heat map to display the coefficients of each feature in relation to rarity was especially helpful in visualizing the importance of different attributes. They also utilized a Random Forest model, which aligned with our approach. However, we believe it would be beneficial for them to implement a Decision Tree model on the top features identified by their analysis, as this might provide additional insights into the key drivers of rarity.
Another group that stood out was the last group in the presentation. However, I found their approach somewhat confusing, particularly their model and plot for level and rarity. Since level is essentially a numerical representation of rarity (with levels 1-4 corresponding to common, 5-7 to rare, 8-10 to epic, and 20 to legendary), predicting rarity based on level seems redundant. The rarity is already clearly defined in the raw data, so predicting it based on the level doesn't add value and might be considered unnecessary.